Matching

Matching and subclassification

The experimental ideal allows us to assume this:

\[Y_i(0), Y_i(1), X_i \mathrel{\unicode{x2AEB}} D_i\] But we could theoretically get the average treatment effect if we can assume that the assignment is conditionally independent of expected outcomes.

\[Y_i(0), Y_i(1) \mathrel{\unicode{x2AEB}} D_i | X_i\]

Regression controls attempt to meet this more relaxed condition, but methods like matching and subclassification may be able to do so more flexibly.

Subclassification

Death rates per 1,000 person-years. From: Causal Inference Mixtape Ch. 5
Smoking group Canada UK US
Non-smokers 20.2 11.3 13.5
Cigarettes 20.5 14.1 13.5
Cigars/pipes 35.5 20.7 17.4

Subclassification

Mean ages. From: Causal Inference Mixtape Ch. 5
Smoking group Canada British US
Non-smokers 54.9 49.1 57.0
Cigarettes 50.5 49.8 53.2
Cigars/pipes 65.9 55.7 59.7

Subclassification

Death rates by age and smoking. From From: Causal Inference Mixtape Ch. 5
Age group Death rates # of Cigarette smokers # of Pipe or cigar smokers
Age 20–40 20 65 10
Age 41–70 40 25 25
Age >71 60 10 65
Total 100 100

Subclassification

Death rates by age and smoking. From From: Causal Inference Mixtape Ch. 5
Age group Death rates # of Cigarette smokers # of Pipe or cigar smokers
Age 20–40 20 65 10
Age 41–70 40 25 25
Age >71 60 10 65
Total 100 100

The un-adjusted death rate for cigarette smokers is:

\[ 20 \times \frac{65}{100} + 40 \times \frac{25}{100} + 60 \times \frac{10}{100} = 29 \]

Their age-adjusted death rate is:

\[ 20 \times \frac{10}{100} + 40 \times \frac{25}{100} + 60 \times \frac{65}{100} = 51 \]

The area of common support

  • Subclassification is one way to adjust for differences using weighting, but there are some limitations here: what happens if I try to adjust for additional confounders? Or I try to group subjects into smaller age ranges?
Age and Gender Death rates # of Cigarette smokers # of Pipe or cigar smokers
Age 20 M 0 10 -
Age 20 F 1 - 1
Age 21 M 3 1 -
Age 21 F 0 7 2
Age 22 M 0 - 2
Age 22 F 1 9 1

We don’t have any way to “re-weight” the death rates at some values because there’s no data in one side or another.

Methods like subclassification only work when there’s sufficient overlap between treated and control units. This is sometimes referred to as the “area of common support” assumption.

The area of common support and ATT

  • One way to mitigate the common support problem is to estimate the ATT instead of the ATE. If the missing data only appear in the treatment category, then we can still re-weight all of the non-zero treated units and recieve a valid ATT estimate:
Age and Gender Death rates # of Cigarette smokers # of Pipe or cigar smokers
Age 20 M 0 10 1
Age 20 F 1 - 1
Age 21 M 3 1 1
Age 21 F 0 7 2
Age 22 M 0 - 2
Age 22 F 1 9 1

Note that, in some conditions, the ATT is a more sensible estimand anyways: everyone won’t necessarily take the “treatment”!

Matching

Matching is an related method for addressing confounding. In the simplest matching models, we would try to find identify control units with the similar values of all confounders for every treated unit, then compare results.

\[ \text{ATT} = \frac{1}{N_T}\sum_{D_i=1} (Y_i - Y_j(i)) \] (Where \(Y_j(i)\) is the matching unit for \(Y_i\))

If the conditional independence assumption holds, then the average difference between the treated unit and its matched control should be equal to the difference in potential outcomes for the treated units.

Matching

In cases where we have more than one match for a single treatment unit, we can average over the matches:

\[ \text{ATT} = \frac{1}{N_T}\sum_{D_i=1} (Y_i -\left[\frac{1}{M} \sum^M_{m=1} Y_{j_m(1)} \right] \] … Where \(M\) represents the number of matching units for \(Y_i\)

Matching

Moreover, we can still estimate the ATE with this approach by finding counterfactuals for all treated and control units.

\[ {ATE} = \dfrac{1}{N} \sum_{i=1}^N (2D_i - 1) \bigg [ Y_i - \bigg ( \dfrac{1}{M} \sum_{m=1}^M Y_{j_m(i)} \bigg ) \bigg ] \]

Matching

…So how do we identify matching units? In most studies, “exact” matching is ideal, but unfeasible because of that common support requirement (especially when we want to adjust for continuous confounders like age)

But we can use various methods to find similar units instead of exact matches.

Approximate Matching

For instance, we might use the Euclidean distance to compare vectors of covariates, then find the nearest match for each “point” (this is called nearest neighbor matching)

Approximate Matching

(Many nearest-neighbor matching studies will use the Mahalanobis distance instead, which is a metric that accounts for correlations between covariates)

Approximate Matching

The variations here are practically endless:

  • For nearest neighbor matching, we might increase or decrease the number of matches permitted for each unit.

  • We might also limit the distance between matched units so that, if nothing is close enough, we discard that match entirely.

  • We might pair an approximate matching method with additional weighting methods to account for dissimilarities between pairs.

Propensity scores

Instead of matching on values themselves, propensity score methods either re-weight or match observations based on a predicted propensity to receive treatment:

  • Estimate a model to predict being in the treatment group
  • Then either:
    • Weight cases based on their inverse probability of treatment.
    • Match based on similar propensities (although this is now considered bad practice)

Propensity scores: inverse probability weighting

From: https://www.andrewheiss.com/blog/2024/03/21/demystifying-ate-att-atu/

Propensity scores: inverse probability of treatment

From: https://www.andrewheiss.com/blog/2024/03/21/demystifying-ate-att-atu/

Propensity scores

In practice propensity scores are often completely fine, but theoretically they can actually worsen bias if the treatment selection process isn’t modeled correctly. Other methods are more robust to this kind of mis-specification.

Coarsened Exact Matching

Coarsened exact matching works by:

  • “Coarsening” the variables, prune unmatched observations (usually requires some user-input)
  • Conduct exact matching on the coarsened results
  • Reweight as needed to match groups

Lalonde (1986)

  • National Supported Work experiment: qualified applicants randomly assigned to either a treatment or control condition. Treatment group received a guaranteed job for 9-18 months with additional counseling and support
  • Very positive results!
(1)
(Intercept)4554.802 ***
(408.046)   
treat1794.343 ** 
(632.854)   
N445        
R20.018    
logLik-4542.741    
AIC9091.482    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Lalonde (1986)

Lalonde (1986)

  • Lalonde swapped the control group for a sample from the general population, and demonstrated that a similarly constructed study with observational data would fail to find the correct result, even after controlling for observed characteristics
vs experimental controlvs population
(Intercept)4554.802 ***6984.170 ***
(408.046)   (360.710)   
treat1794.343 ** -635.026    
(632.854)   (657.137)   
N445        614        
R20.018    0.002    
logLik-4542.741    -6346.371    
AIC9091.482    12698.742    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Lalonde (1986)

(he tried a bunch of stuff, none of it worked)

Lalonde (1986)

Candidates for job training were unlike the general population. They weren’t even especially similar to people with similar income, education, age, race, etc.

The treatment group differed from the general population in their expected potential income. So a naive difference of means would be a combination of the impact of job training and the difference in potential outcomes with the general population.

\[\underbrace{E[Y_i(1) - Y_i(0)|D_i=1]}_\text{job training effect} + \underbrace{E[Y_i(0)|D_i=1] - E[Y_i(0)|D_i=0]}_\text{differences between applicants and non-applicants}\]

Observed Imbalance

Variable N population
N = 429
control
N = 260
treated
N = 185
age 874 25 (19, 35) 24 (19, 28) 25 (20, 29)
educ 874 11 (9, 12) 10 (9, 11) 11 (9, 12)
race 874
black 87 (20%) 215 (83%) 156 (84%)
hispan 61 (14%) 28 (11%) 11 (5.9%)
white 281 (66%) 17 (6.5%) 18 (9.7%)
married 874 220 (51%) 40 (15%) 35 (19%)
nodegree 874 256 (60%) 217 (83%) 131 (71%)
income 1974 874 2,547 (0, 9,277) 0 (0, 279) 0 (0, 1,291)
income 1975 874 1,087 (0, 3,881) 0 (0, 655) 0 (0, 1,817)
1 Median (Q1, Q3); n (%)

Imbalance

Control variables probably can’t fix this.

  • large observed differences often imply large unobserved differences
  • even if all confounding is observed and controlled, regression controls assume a linear relationship between the confounders and the treatment effects.

Imbalance

Applied for program \(\rightarrow\) Earnings

Applied for program \(\leftarrow\) Education \(\rightarrow\) Earnings

mrdag E Education D Applied for program E->D Y Earnings E->Y D->Y

Imbalance

dataFunction<-function(){
  N<-1000
  education<-rpois(N, 3) # education
  D<-rbinom(N, 1, arm::invlogit(education)) # applied to job program
  Y_1<- 2 + education^2 +  rnorm(100) # potential outcome for training
  Y_0<- 0 + education^2 +  rnorm(100) # potential outcome for no training
  ATE<-mean(Y_1-Y_0) # the true effect 
  Y<-ifelse(D==1, Y_1, Y_0) # the observed outcomes 
  df<-tibble("income" =Y, Y_0=Y_0, Y_1=Y_1,
             'treated' = factor(D, labels=c("Control", "Treated")), 
             education)
  return(df)
}

set.seed(9999)
data<-dataFunction()
ggplot(data, aes(x=income,y=treated)) + 
  stat_halfeye() +
  theme_bw() +
  xlab("Potential Outcome if Y(1)")

mean(data$Y_1 - data$Y_0)
[1] 1.870983

Imbalance

Notably, since the education confounder has a non-linear effect on income, we don’t actually get a correct estimate even if we control for it:

tinytable_ijmitv75oddcwxlcv71k
no controls ols + control
actual ATE = 1.871
(Intercept) 2.721 -5.448
(1.187) (0.411)
treatedTreated 12.322 -2.020
(1.259) (0.455)
education 7.084
(0.080)
Num.Obs. 1000 1000

Exact matching

Exact matching

Exact matching

Exact matching on education gets us close to the correct answer:

ate<-mean(data$Y_1 - data$Y_0)

matched<-matchit(treated ~ education ,data=data, 
                 method='exact',
                 estimand ='ATE'
                 )
mdat<-match.data(matched)

exact_match_fit <- lm(income ~ treated + education,
           data = mdat, weights = weights)

modelsummary(list('no controls'=ols, 
                  'ols + control' =ols_control, 
                  'exact matching' =exact_match_fit), 
             gof_map = c("nobs"),
             note = sprintf("actual ATE = %s", round(ate,digits=3))
             )
tinytable_ti26881cjlig4zn19scp
no controls ols + control exact matching
actual ATE = 1.871
(Intercept) 2.721 -5.448 -5.446
(1.187) (0.411) (0.335)
treatedTreated 12.322 -2.020 1.879
(1.259) (0.455) (0.304)
education 7.084 5.546
(0.080) (0.071)
Num.Obs. 1000 1000 862

Exact matching

In repeated simulations, the exact matching estimates are unbiased and have only slightly more variance than the actual average treatment effect:

set.seed(999)
fun<-function(){
  df <- dataFunction() # re-using the data generating function
  ATE <- mean(df$Y_1 - df$Y_0)
  model<-lm(income ~ education + treated, data=df)
  matched<-matchit(treated ~ education ,data=df, method='exact' ,estimand='ATE')

  mdat<-match.data(matched)

  fit1 <- lm(income ~ treated ,
           data = mdat, weights = weights)
  res<-tibble('regression estimate' = coef(model)[3], 
                  #'matched' = mean(result$difference), 
              "matched estimate"=coef(fit1)[2],
              "Average Treatment Effect" = ATE
              )
return(res)

}

results<-replicate(500, fun(), simplify = F)|>
  bind_rows()|>
  pivot_longer(cols=everything())|>
  mutate(value = unlist(value))

Propensity scores

matched<-matchit(treated ~ education ,data=df, 
                 method ='nearest',
                 replace=TRUE,
                 distance='glm')
mdat<-match.data(matched)
#nrow(mdat)
psm_fit <- lm(income ~ treated + education , data=mdat, weights = weights)

modelsummary(list('ols + control' =ols_control, 
                  "exact matching"=exact_match_fit,
                  "psm" = psm_fit
                  ), 
             gof_map = c("nobs")
             )
tinytable_1y09wl210ydv01mdqbk3
ols + control exact matching psm
(Intercept) -5.448 -5.446 -9.452
(0.411) (0.335) (0.537)
treatedTreated -2.020 1.879 1.479
(0.455) (0.304) (0.495)
education 7.084 5.546 7.243
(0.080) (0.071) (0.077)
Num.Obs. 1000 862 965

Propensity scores

Coarsened Exact Matching

matched<-matchit(treated ~ education ,data=df, method='cem' ,
                 distance = 'glm', estimand = "ATE")
mdat<-match.data(matched)
cem_fit <- lm(income ~ treated + education , data=mdat, weights = weights)

modelsummary(list('ols + control' =ols_control, 
                  "exact matching"=exact_match_fit,
                  "psm" = psm_fit,
                  "cem" = cem_fit
                  ), 
             gof_map = c("nobs"),
             note = sprintf("actual ATE = %s", round(ate,digits=3))

             )
tinytable_fw5w84rdpya4f8ebd7l8
ols + control exact matching psm cem
actual ATE = 1.871
(Intercept) -5.448 -5.446 -9.452 -5.446
(0.411) (0.335) (0.537) (0.335)
treatedTreated -2.020 1.879 1.479 1.879
(0.455) (0.304) (0.495) (0.304)
education 7.084 5.546 7.243 5.546
(0.080) (0.071) (0.077) (0.071)
Num.Obs. 1000 862 965 862

Standard errors

Methods like CEM will generally group variables together before weighting, so we would expect to see clustering within the strata, but we know how to deal with this!


 Estimate 2.5 % 97.5 %
     1.88  1.28   2.48

Term: treated
Type: response
Comparison: Treated - Control

Takeaways

  • Matching can address some of the limitations of regression models and simulate experimental conditions.

  • Main advantages over regression are:

    • No assumption of a linear relationship between confounders
    • More informally: allows researchers to separate out the treatment selection process from the treatment effect process.